South Park is an American TV show. It is well known for being very satirical. Pretty much every famous person has already been made fun of in the series. I literally watch it every day! I also do lots of analyses in R every day. I just thought to myself, why haven’t I analysed South Park texts yet? And that’s when I decided to combine two things I am passionate about. Read on to see how easy it was!

So I have the idea, but where do I start?

First things first. I had to find a resource with all the text in a reasonable format. It took just a bit of Googling to find a South Park gold mine! I typed South Park scripts into Google and the very first link was exactly what I was looking for! South Park archives–a page with community maintained scripts for all episodes! Isn’t that great?

You can find a list of seasons on that page. And after clicking on a season, an episode list comes up. An episode page contains a nice table with two columns. The first column is a character name. And the second column is the actual line that character said. That’s a perfect start.

There was one last thing I wanted to know about each episode. Their popularity! I’m sure that you know IMDB–Internet Movie Database. It contains ratings for all movies and Tv shows as well.

But how to put it all together? I wrote a simple R package called southparkr that anyone can use and do their own analyses!

Data acquired. BINGO! Let’s dig in.

The second step was to determine, what exactly do I want to analyse? I decided on doing two things:

  1. Sentiment analysis of episodes,
  2. Episode popularity based on IMDB ratings.

We’ll get to that in a minute. We should first have a look at the data we acquired. Have a look at the following table. It summarises all episodes in a few numbers.

Number of seasons: 21
Number of episodes: 287
Number of words: 907 797
No stopwords (a, the, this, …): 310 759
% used for analysis: 34.23
Average IMDB rating: 8.14
Best episode (9.6): Scott Tenorman Must Die S05E04
Worst episode (6.3): Funnybot S15E02

You can see that the show has been on for 21 seasons already. All the characters combined have said almost 1 million words! That is if we count all words. If we exclude stop words, we end up with about 300 thousand words. Stop words are preposition, articles or other very usual words.

All the episodes sustain an average rating of roughly 8.1 which is great! It seems that the show is popular. I always consider anything above rating 8 very watchable! You can also see the best and the worst episode. So in case you don’t know the show, this is where you might start. It is almost guaranteed that you won’t be dissapointed.

Let’s dig deeper and get sentimental!

https://en.wikipedia.org/wiki/Sentiment_analysis

gg <- ggplot(by_episode, aes(episode_number, mean_sentiment_score, group = 1, text = text_sent)) +
    geom_col(color = "#592a88") +
    geom_smooth()

ggplotly(gg, tooltip = "text")

Conclusion